In [1]:

# Initialize Notebook
%run library/init.ipy
HTML('''<script> code_show=true;  function code_toggle() {  if (code_show){  $('div.input').hide();  } else {  $('div.input').show();  }  code_show = !code_show }  $( document ).ready(code_toggle); </script> <form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')

Out[1]:

p53-depletion by siRNA and LncRNA Analysis in WA-09 Cell Line | BioJupies¶

Introduction¶

This notebook contains an analysis of GEO dataset GSE76023 (https://www.ncbi.nlm.nih.gov/gds/?term=GSE76023) created using the BioJupies Generator.

Table of Contents¶

The notebook is divided into the following sections:

Load Dataset - Loads and previews the input dataset in the notebook environment.
PCA - Linear dimensionality reduction technique to visualize similarity between samples
Clustergrammer - Interactive hierarchical clustering heatmap visualization
Library Size Analysis - Analysis of readcount distribution for the samples within the dataset
Differential Expression Table - Differential expression analysis between two groups of samples
Volcano Plot - Plot the logFC and logP values resulting from a differential expression analysis
MA Plot - Plot the logFC and average expression values resulting from a differential expression analysis
Enrichr Links - Links to enrichment analysis results of the differentially expressed genes via Enrichr
Gene Ontology Enrichment Analysis - Identifies Gene Ontology terms which are enriched in the differentially expressed genes
Pathway Enrichment Analysis - Identifies biological pathways which are enriched in the differentially expressed genes
Transcription Factor Enrichment Analysis - Identifies transcription factors whose targets are enriched in the differentially expressed genes

Results¶

1. Load Dataset¶

Here, the GEO dataset GSE76023 is loaded into the notebook. Expression data was quantified as gene-level counts using the ARCHS⁴ pipeline (Lachmann et al., 2017), available at http://amp.pharm.mssm.edu/archs4/.

In [2]:

# Load dataset
dataset = load_dataset(source='archs4', gse='GSE76023', platform='GPL11154')

# Preview expression data
dataset['rawdata'].head()

Out[2]:

	GSM1972961	GSM1972956	GSM1972957	GSM1972954	GSM1972955	GSM1972958	GSM1972959	GSM1972960	GSM1972963	GSM1972962	GSM1972967	GSM1972966	GSM1972965	GSM1972964
A1BG	57	625	564	603	761	641	542	31	150	75	153	111	112	121
A1CF	1	48	36	45	49	73	15	1	10	4	18	18	21	20
A2M	5	122	78	159	167	917	510	3	130	116	37	49	16	12
A2ML1	223	952	355	846	1083	346	298	256	568	449	409	252	292	246
A2MP1	3	31	23	38	38	73	16	2	11	9	15	18	13	20

Table 1 | RNA-seq expression data. The table displays the first 5 rows of the quantified RNA-seq expression dataset. Rows represent genes, columns represent samples, and values show the number of mapped reads.

In [3]:

# Display metadata
display_metadata(dataset)

	Sample Title	cell line	treatment
Sample_geo_accession
GSM1972961	hESC_siControl+RA_replicate2	H9	Control treated with Retinoic Acid (RA) for 4 ...
GSM1972956	hESC_shControl_replicate3	H9	Control
GSM1972957	hESC_shLncRNA_replicate1	H9	lncRNA Knock Down
GSM1972954	hESC_shControl_replicate1	H9	Control
GSM1972955	hESC_shControl_replicate2	H9	Control
GSM1972958	hESC_shLncRNA_replicate2	H9	lncRNA Knock Down
GSM1972959	hESC_shLncRNA_replicate3	H9	lncRNA Knock Down
GSM1972960	hESC_siControl+RA_replicate1	H9	Control treated with Retinoic Acid (RA) for 4 ...
GSM1972963	hESC_siControl_replicate2	H9	Control
GSM1972962	hESC_siControl_replicate1	H9	Control
GSM1972967	hESC_siTP53_replicate2	H9	TP53 Knock Down
GSM1972966	hESC_siTP53_replicate1	H9	TP53 Knock Down
GSM1972965	hESC_siTP53+RA_replicate2	H9	TP53 Knock Down treated with Retinoic Acid (RA...
GSM1972964	hESC_siTP53+RA_replicate1	H9	TP53 Knock Down treated with Retinoic Acid (RA...

Table 2 | Sample metadata. The table displays the metadata associated with the samples in the RNA-seq dataset. Rows represent RNA-seq samples, columns represent metadata categories.

In [4]:

# Configure signatures
dataset['signature_metadata'] = {
    'siControl vs siRNA': {
        'A': ['GSM1972954', 'GSM1972955', 'GSM1972956', 'GSM1972960', 'GSM1972961', 'GSM1972962', 'GSM1972963'],
        'B': ['GSM1972957', 'GSM1972958', 'GSM1972959', 'GSM1972964', 'GSM1972965', 'GSM1972966', 'GSM1972967']
    }
}

# Generate signatures
for label, groups in dataset['signature_metadata'].items():
    signatures[label] = generate_signature(group_A=groups['A'], group_B=groups['B'], method='limma', dataset=dataset)

2. PCA¶

Principal Component Analysis (PCA) is a statistical technique used to identify global patterns in high-dimensional datasets. It is commonly used to explore the similarity of biological samples in RNA-seq datasets. To achieve this, gene expression values are transformed into Principal Components (PCs), a set of linearly uncorrelated features which represent the most relevant sources of variance in the data, and subsequently visualized using a scatter plot.

In [5]:

# Run analysis
results['pca'] = analyze(dataset=dataset, tool='pca', nr_genes=2500, normalization='logCPM', z_score='True')

# Display results
plot(results['pca'])

** Figure 1 | Principal Component Analysis results. ** The figure displays an interactive, three-dimensional scatter plot of the first three Principal Components (PCs) of the data. Each point represents an RNA-seq sample. Samples with similar gene expression profiles are closer in the three-dimensional space. If provided, sample groups are indicated using different colors, allowing for easier interpretation of the results.

3. Clustergrammer¶

Clustergrammer is a web-based tool for visualizing and analyzing high-dimensional data as interactive and hierarchically clustered heatmaps. It is commonly used to explore the similarity between samples in an RNA-seq dataset. In addition to identifying clusters of samples, it also allows to identify the genes which contribute to the clustering.

In [6]:

# Run analysis
results['clustergrammer'] = analyze(dataset=dataset, tool='clustergrammer', nr_genes=2500, normalization='logCPM', z_score='True')

# Display results
plot(results['clustergrammer'])

** Figure 2 | Clustergrammer analysis. **The figure contains an interactive heatmap displaying gene expression for each sample in the RNA-seq dataset. Every row of the heatmap represents a gene, every column represents a sample, and every cell displays normalized gene expression values. The heatmap additionally features color bars beside each column which represent prior knowledge of each sample, such as the tissue of origin or experimental treatment.

4. Library Size Analysis¶

In order to quantify gene expression in an RNA-seq dataset, reads generated from the sequencing step are mapped to a reference genome and subsequently aggregated into numeric gene counts. Due to experimental variations and random technical noise, samples in an RNA-seq datasets often have variable amounts of the total RNA. Library size analysis calculates and displays the total number of reads mapped for each sample in the RNA-seq dataset, facilitating the identification of outlying samples and the assessment of the overall quality of the data.

In [7]:

# Run analysis
results['library_size_analysis'] = analyze(dataset=dataset, tool='library_size_analysis')

# Display results
plot(results['library_size_analysis'])

** Figure 3 | Library Size Analysis results. **The figure contains an interactive bar chart which displays the total number of reads mapped to each RNA-seq sample in the dataset. Additional information for each sample is available by hovering over the bars. If provided, sample groups are indicated using different colors, thus allowing for easier interpretation of the results

5. Differential Expression Table¶

Gene expression signatures are alterations in the patterns of gene expression that occur as a result of cellular perturbations such as drug treatments, gene knock-downs or diseases. They can be quantified using differential gene expression (DGE) methods, which compare gene expression between two groups of samples to identify genes whose expression is significantly altered in the perturbation. The signature table is used to interactively display the results of such analyses.

In [8]:

# Initialize results
results['signature_table'] = {}

# Loop through signatures
for label, signature in signatures.items():

    # Run analysis
    results['signature_table'][label] = analyze(signature=signature, tool='signature_table', signature_label=label)

    # Display results
    plot(results['signature_table'][label])

	logFC	AveExpr	P-value	FDR
Gene
*BNIP3	-1.38	5.93	4.589204e-07	0.016171
*EGLN1	-1.39	4.38	2.229788e-06	0.039287
TEK	-1.45	4.14	6.830855e-06	0.070206
SYT11	1.60	3.69	7.969307e-06	0.070206
SMS	-1.20	8.04	1.547031e-05	0.105668
LRRIQ4	-2.85	-4.24	1.799223e-05	0.105668
ZNF732	-3.71	0.82	3.104663e-05	0.135925
CLDN1	-1.65	1.97	3.131367e-05	0.135925
ZNF680	-4.17	2.08	3.641696e-05	0.135925
A2ML1	-0.98	3.21	3.857353e-05	0.135925
CHMP1B2P	2.19	1.33	5.005413e-05	0.153510
DDIT4	-1.36	5.90	5.227664e-05	0.153510
MYOCD	-2.87	1.13	9.285398e-05	0.248501
ARHGAP40	-1.70	-0.86	9.872922e-05	0.248501
ANKRD37	-1.78	0.94	1.336163e-04	0.294369
NPPB	-4.01	-1.82	1.336598e-04	0.294369
LDHAL6FP	2.53	-4.06	1.498579e-04	0.310629
BEX5	-2.28	-0.48	1.701411e-04	0.311160
ZNF667	1.99	2.36	1.783920e-04	0.311160
SPATA31A5	3.96	-3.86	1.805070e-04	0.311160
FOXD1	-1.91	-0.72	1.900143e-04	0.311160
PGK1	-0.79	9.13	1.942655e-04	0.311160
S100A10	-1.26	5.18	2.132083e-04	0.312459
ALOXE3	-2.30	-2.34	2.203366e-04	0.312459
TCEAL5	1.52	-0.76	2.270673e-04	0.312459
ALX1	-1.72	0.42	2.391184e-04	0.312459
RP11-164C12.1	1.80	-5.18	2.400245e-04	0.312459
C17ORF51	-3.41	0.75	2.495581e-04	0.312459
RGS5	-1.87	4.40	2.571458e-04	0.312459
SLC27A6	-1.00	2.50	2.757710e-04	0.323921
ADCYAP1	3.01	-3.96	2.972895e-04	0.337932
CCDC105	-1.60	-5.73	3.166825e-04	0.348727
LDHA	-1.33	8.84	3.443217e-04	0.367673
C16ORF54	1.32	2.49	3.677980e-04	0.377844
FGFR1	0.55	9.39	3.752922e-04	0.377844
PRKX	-0.86	6.85	4.548420e-04	0.443738
RP11-478C6.5	-3.28	-3.45	4.732157e-04	0.443738
LAMB3	-1.83	-1.01	4.856862e-04	0.443738
RP11-452N17.1	-4.70	-3.90	4.928919e-04	0.443738
TES	-0.84	5.00	5.379305e-04	0.443738
FFAR3	-1.30	-5.85	5.548237e-04	0.443738
RP4-614C10.1	-1.45	-5.77	5.588798e-04	0.443738
OR9A3P	1.98	-4.50	5.596973e-04	0.443738
C11ORF45	-1.59	-0.68	5.759625e-04	0.443738
FBXW12	1.13	1.19	6.098803e-04	0.443738
FAM72C	-0.99	3.67	6.146733e-04	0.443738
SFXN3	-1.04	1.54	6.276078e-04	0.443738
LYPD6B	-1.19	0.90	6.465206e-04	0.443738
SSXP10	3.17	-3.24	6.702067e-04	0.443738
EPS8L2	-0.78	4.40	6.837556e-04	0.443738
THBS1	-1.55	6.44	6.848606e-04	0.443738
OTOP2	-2.69	-4.78	6.865404e-04	0.443738
KRT17	-2.97	-2.28	6.918509e-04	0.443738
RP11-195E2.1	2.86	-3.57	6.961920e-04	0.443738
TAGLN2	-0.91	5.86	7.062242e-04	0.443738
CAT	-2.82	1.80	7.138099e-04	0.443738
RPL12P30	1.85	-4.87	7.587010e-04	0.443738
COL11A2	-1.09	2.19	7.636107e-04	0.443738
GABBR2	-1.34	1.98	7.822274e-04	0.443738
OR14A16	-1.13	-5.96	8.129828e-04	0.443738
IL32	-1.54	1.14	8.382889e-04	0.443738
REPS2	-0.83	3.58	8.426418e-04	0.443738
C4ORF47	-1.64	0.47	8.445356e-04	0.443738
CH507-152C13.3	2.14	-4.15	8.450843e-04	0.443738
FGF11	-0.65	5.09	8.794874e-04	0.443738
PALLD	-1.08	5.37	8.799509e-04	0.443738
KRT9	-2.11	-5.34	9.053891e-04	0.443738
LYPD6	-0.74	3.54	9.079779e-04	0.443738
H1F0	-0.80	7.54	9.183412e-04	0.443738
FAAH	1.10	3.63	9.301756e-04	0.443738
SLC17A8	-2.31	-3.70	9.309911e-04	0.443738
PABPC4L	-1.47	1.87	9.544448e-04	0.443738
HK2	-0.67	6.75	9.691204e-04	0.443738
CER1	-2.09	-0.13	9.892299e-04	0.443738
SEC14L4	1.28	-0.38	9.957197e-04	0.443738
MST1P2	-1.11	-0.36	1.000185e-03	0.443738
AFAP1L2	-1.57	2.61	1.006632e-03	0.443738
RP11-475I24.9	2.07	-3.16	1.015847e-03	0.443738
MYL7	-2.29	-0.22	1.032060e-03	0.443738
FLVCR1	0.67	5.97	1.039717e-03	0.443738
P4HA1	-0.98	5.79	1.046038e-03	0.443738
DNM1	0.72	5.40	1.066345e-03	0.443738
RP11-366M4.11	-1.97	0.98	1.074233e-03	0.443738
AC012512.1	-1.75	-0.04	1.074879e-03	0.443738
IQCA1	-0.96	3.75	1.077756e-03	0.443738
FBLN5	-1.04	0.94	1.087867e-03	0.443738
BLCAP	0.52	5.37	1.095555e-03	0.443738
RP11-495P10.9	-2.61	-4.20	1.116928e-03	0.445183
GPX5	-2.04	-5.22	1.124392e-03	0.445183
ZNF785	0.64	3.71	1.144775e-03	0.448218
EN2	3.65	-3.42	1.217064e-03	0.467745
DUOXA1	-1.55	-1.69	1.230742e-03	0.467745
LRRC4	1.38	3.48	1.234752e-03	0.467745
CPNE4	-0.88	1.10	1.271898e-03	0.467745
STARD13	-0.73	2.25	1.280567e-03	0.467745
C9ORF64	-1.75	2.81	1.285778e-03	0.467745
RP5-1100H13.3	-1.53	-5.81	1.287568e-03	0.467745
RP5-877J2.1	-1.60	-5.67	1.305208e-03	0.469315
CHMP4C	-1.32	1.44	1.348082e-03	0.478209
SERPINB6	-0.68	4.53	1.357085e-03	0.478209

** Table 3 | Differential Expression Table.** The figure displays a browsable table containing the gene expression signature generated from a differential gene expression analysis. Every row of the table represents a gene; the columns display the estimated measures of differential expression. Links to external resources containing additional information for each gene are also provided

6. Volcano Plot¶

Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the significance versus fold-change estimated by the differential expression analysis.

In [9]:

# Initialize results
results['volcano_plot'] = {}

# Loop through signatures
for label, signature in signatures.items():

    # Run analysis
    results['volcano_plot'][label] = analyze(signature=signature, tool='volcano_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5)

    # Display results
    plot(results['volcano_plot'][label])

** Figure 4 | Volcano Plot. **The figure contains an interactive scatter plot which displays the log2-fold changes and statistical significance of each gene calculated by performing a differential gene expression analysis. Every point in the plot represents a gene. Red points indicate significantly up-regulated genes, blue points indicate down-regulated genes. Additional information for each gene is available by hovering over it.

7. MA Plot¶

Volcano plots are a type of scatter plot commonly used to display the results of a differential gene expression analysis. They can be used to quickly identify genes whose expression is significantly altered in a perturbation, and to assess the global similarity of gene expression in two groups of biological samples. Each point in the scatter plot represents a gene; the axes display the average gene expression versus fold-change estimated by the differential expression analysis.

In [10]:

# Initialize results
results['ma_plot'] = {}

# Loop through signatures
for label, signature in signatures.items():

    # Run analysis
    results['ma_plot'][label] = analyze(signature=signature, tool='ma_plot', signature_label=label, pvalue_threshold=0.05, logfc_threshold=1.5)

    # Display results
    plot(results['ma_plot'][label])

** Figure 5 | MA Plot. **The figure contains an interactive scatter plot which displays the average expression and statistical significance of each gene calculated by performing differential gene expression analysis. Every point in the plot represents a gene. Red points indicate significantly up-regulated genes, blue points indicate down-regulated genes. Additional information for each gene is available by hovering over it.

8. Enrichr Links¶

Enrichment analysis is a statistical procedure used to identify biological terms which are over-represented in a given gene set. These include signaling pathways, molecular functions, diseases, and a wide variety of other biological terms obtained by integrating prior knowledge of gene function from multiple resources. Enrichr is a web-based application which allows to perform enrichment analysis using a large collection of gene-set libraries and various interactive approaches to display enrichment results.

In [11]:

# Initialize results
results['enrichr'] = {}

# Loop through signatures
for label, signature in signatures.items():

    # Run analysis
    results['enrichr'][label] = analyze(signature=signature, tool='enrichr', signature_label=label, geneset_size=500)

    # Display results
    plot(results['enrichr'][label])

siControl vs siRNA Signature:¶

Upregulated: https://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3sei9
Downregulated: https://amp.pharm.mssm.edu/Enrichr/enrich?dataset=3seia

** Table 4 | Enrichr links. **The table displays links to Enrichr containing the results of enrichment analyses generated by analyzing the up-regulated and down-regulated genes from a differential expression analysis. By clicking on these links, users can interactively explore and download the enrichment results from the Enrichr website

9. Gene Ontology Enrichment Analysis¶

Gene Ontology (GO) is a major bioinformatics initiative aimed at unifying the representation of gene attributes across all species. It contains a large collection of experimentally validated and predicted associations between genes and biological terms. This information can be leveraged by Enrichr to identify the biological processes, molecular functions and cellular components which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.

In [12]:

# Initialize results
results['go_enrichment'] = {}

# Loop through results
for label, enrichr_results in results['enrichr'].items():

    # Run analysis
    results['go_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='go_enrichment', signature_label=label)

    # Display results
    plot(results['go_enrichment'][label])

** Figure 6 | Gene Ontology Enrichment Analysis Results. **The figure contains interactive bar charts displaying the results of the Gene Ontology enrichment analysis generated using Enrichr. The x axis indicates the enrichment score for each term. Significant terms are highlighted in bold. Additional information about enrichment results is available by hovering over each bar

10. Pathway Enrichment Analysis¶

Biological pathways are sequences of interactions between biochemical compounds which play a key role in determining cellular behavior. Databases such as KEGG, Reactome and WikiPathways contain a large number of associations between such pathways and genes. This information can be leveraged by Enrichr to identify the biological pathways which are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.

In [13]:

# Initialize results
results['pathway_enrichment'] = {}

# Loop through results
for label, enrichr_results in results['enrichr'].items():

    # Run analysis
    results['pathway_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='pathway_enrichment', signature_label=label)

    # Display results
    plot(results['pathway_enrichment'][label])

** Figure 7 | Pathway Enrichment Analysis Results.** The figure contains interactive bar charts displaying the results of the pathway enrichment analysis generated using Enrichr. The x axis indicates the enrichment score for each term. Significant terms are highlighted in bold. Additional information about enrichment results is available by hovering over each bar.

11. Transcription Factor Enrichment Analysis¶

Transcription Factors (TFs) are proteins involved in the transcriptional regulation of gene expression. Databases such as ChEA and ENCODE contain a large number of associations between TFs and their transcriptional targets. This information can be leveraged by Enrichr to identify the transcription factors whose targets are over-represented in the up-regulated and down-regulated genes identified by comparing two groups of samples.

In [14]:

# Initialize results
results['tf_enrichment'] = {}

# Loop through results
for label, enrichr_results in results['enrichr'].items():

    # Run analysis
    results['tf_enrichment'][label] = analyze(enrichr_results=enrichr_results['results'], tool='tf_enrichment', signature_label=label)

    # Display results
    plot(results['tf_enrichment'][label])

A. ChEA (experimentally validated targets)¶

Rank	Transcription Factor	P-value	FDR	Target
1	SUZ12	0.000488	0.305163	47 downregulated targets
2	BMI1	0.001177	0.367937	43 downregulated targets
3	EZH2	0.002960	0.616589	36 downregulated targets
4	CBX2	0.020278	1.000000	31 upregulated targets
5	JARID2	0.022361	1.000000	39 upregulated targets
6	EED	0.044348	1.000000	29 upregulated targets
7	TP53	0.052473	1.000000	37 upregulated targets
8	PHC1	0.058499	1.000000	31 upregulated targets
9	RNF2	0.106732	1.000000	33 upregulated targets
10	POU5F1	0.215899	1.000000	19 upregulated targets
11	ERG	0.283907	1.000000	10 upregulated targets
12	TP63	0.291228	1.000000	6 upregulated targets
13	CIITA	0.436027	1.000000	2 upregulated targets
14	TAL1	0.452968	1.000000	2 upregulated targets
15	TCF21	0.608439	1.000000	1 upregulated targets
16	HOXA2	0.618246	1.000000	1 upregulated targets
17	TRIM28	0.661864	1.000000	2 upregulated targets
18	IKZF1	0.747824	1.000000	3 upregulated targets
19	RING1B	0.748248	1.000000	46 upregulated targets
20	CTNNB1	0.774014	1.000000	3 upregulated targets
21	SOX9	0.802680	1.000000	1 upregulated targets
22	IRF8	0.812134	1.000000	6 upregulated targets
23	PRDM16	0.812825	1.000000	2 upregulated targets
24	MYCN	0.839949	1.000000	4 upregulated targets
25	KLF4	0.851418	1.000000	32 upregulated targets
26	TAF15	0.868596	1.000000	4 upregulated targets
27	WT1	0.877679	1.000000	3 upregulated targets
28	BCAT	0.891861	1.000000	12 upregulated targets
29	GATA2	0.892782	1.000000	1 upregulated targets
30	ZFP322A	0.898099	1.000000	1 upregulated targets
31	ESR1	0.916862	1.000000	1 upregulated targets
32	VDR	0.918984	1.000000	8 upregulated targets
33	KLF2	0.926795	1.000000	1 upregulated targets
34	KLF5	0.926795	1.000000	1 upregulated targets
35	SALL1	0.930428	1.000000	1 upregulated targets
36	IRF1	0.935010	1.000000	5 upregulated targets
37	CEBPD	0.938331	1.000000	8 upregulated targets
38	ZNF652	0.941784	1.000000	1 upregulated targets
39	TEAD4	0.942005	1.000000	9 upregulated targets
40	AR	0.950859	1.000000	2 upregulated targets
41	TCF4	0.951373	1.000000	7 upregulated targets
42	HOXD13	0.958520	1.000000	2 upregulated targets
43	IGF1R	0.962245	1.000000	1 upregulated targets
44	FUS	0.963771	1.000000	8 upregulated targets
45	DMRT1	0.965024	1.000000	1 upregulated targets
46	ZNF274	0.965072	1.000000	4 upregulated targets
47	ELF1	0.965904	1.000000	1 upregulated targets
48	NR4A2	0.967209	1.000000	2 upregulated targets
49	STAT1	0.968255	1.000000	10 upregulated targets
50	PCGF2	0.971024	1.000000	6 upregulated targets

B. ENCODE (experimentally validated targets)¶

Rank	Transcription Factor	P-value	FDR	Target
1	SUZ12*	8.104805e-07	0.000624	35 upregulated targets
2	FOSL1	2.913719e-01	1.000000	11 downregulated targets
3	GABPA	3.811643e-01	1.000000	4 downregulated targets
4	EP300	4.452247e-01	1.000000	5 downregulated targets
5	STAT5A	4.594903e-01	1.000000	6 downregulated targets
6	JUN	4.997884e-01	1.000000	13 downregulated targets
7	ZEB1	5.076277e-01	1.000000	5 downregulated targets
8	NR2F2	5.082270e-01	1.000000	4 downregulated targets
9	MAFK	5.354835e-01	1.000000	5 downregulated targets
10	TCF7L2	6.160310e-01	1.000000	13 downregulated targets
11	JUND	6.239895e-01	1.000000	6 downregulated targets
12	REST	6.363081e-01	1.000000	16 downregulated targets
13	TCF12	6.567392e-01	1.000000	20 downregulated targets
14	RXRA	6.789487e-01	1.000000	3 downregulated targets
15	NR3C1	7.096907e-01	1.000000	5 downregulated targets
16	FOSL2	7.186941e-01	1.000000	7 downregulated targets
17	ESR1	7.263139e-01	1.000000	14 downregulated targets
18	CBX2	7.482475e-01	1.000000	46 downregulated targets
19	EZH2	7.482475e-01	1.000000	46 downregulated targets
20	CEBPB	7.722236e-01	1.000000	6 downregulated targets
21	NANOG	7.778610e-01	1.000000	2 downregulated targets
22	ESRRA	7.860925e-01	1.000000	2 downregulated targets
23	BCL11A	7.940548e-01	1.000000	2 downregulated targets
24	FOXM1	7.946520e-01	1.000000	3 downregulated targets
25	NELFE	8.443775e-01	1.000000	6 downregulated targets
26	FOXA2	8.719878e-01	1.000000	6 downregulated targets
27	HSF1	8.725893e-01	1.000000	4 downregulated targets
28	ZKSCAN1	8.764772e-01	1.000000	4 downregulated targets
29	ZC3H11A	8.810919e-01	1.000000	15 downregulated targets
30	TEAD4	8.922340e-01	1.000000	2 downregulated targets
31	CBX3	8.972620e-01	1.000000	3 downregulated targets
32	NFIC	9.453749e-01	1.000000	11 downregulated targets
33	GATA3	9.456574e-01	1.000000	22 downregulated targets
34	CBX8	9.471758e-01	1.000000	40 downregulated targets
35	POLR3A	9.548442e-01	1.000000	2 downregulated targets
36	HDAC2	9.556503e-01	1.000000	3 downregulated targets
37	RFX5	9.566407e-01	1.000000	7 downregulated targets
38	RAD21	9.573528e-01	1.000000	19 downregulated targets
39	MEF2A	9.576790e-01	1.000000	4 downregulated targets
40	RELA	9.587878e-01	1.000000	13 downregulated targets
41	HNF4G	9.684542e-01	1.000000	9 downregulated targets
42	CTCF	9.758291e-01	1.000000	21 downregulated targets
43	STAT3	9.779912e-01	1.000000	37 downregulated targets
44	STAT1	9.800504e-01	1.000000	15 downregulated targets
45	GATA2	9.822087e-01	1.000000	5 downregulated targets
46	FOS	9.836004e-01	1.000000	13 downregulated targets
47	BATF	9.842673e-01	1.000000	11 downregulated targets
48	BCL3	9.875384e-01	1.000000	6 downregulated targets
49	NFE2	9.943888e-01	1.000000	2 downregulated targets
50	ZZZ3	9.951004e-01	1.000000	2 downregulated targets

C. (coexpressed genes)¶

Rank	Transcription Factor	P-value	FDR	Target
1	GATA6*	1.541629e-07	0.000207	25 downregulated targets
2	GRHL3*	2.261381e-05	0.010131	21 downregulated targets
3	ZNF750*	2.261381e-05	0.010131	21 downregulated targets
4	POU3F4*	6.974144e-06	0.010182	22 upregulated targets
5	OVOL1*	6.975492e-05	0.023438	20 downregulated targets
6	TP63*	2.043108e-04	0.039228	19 downregulated targets
7	DSP*	2.043108e-04	0.039228	19 downregulated targets
8	FOXN1*	2.043108e-04	0.039228	19 downregulated targets
9	BMP2	5.671016e-04	0.084687	18 downregulated targets
10	IRF6	5.671016e-04	0.084687	18 downregulated targets
11	EN2	2.043108e-04	0.149147	19 upregulated targets
12	GSC	1.488534e-03	0.181872	17 downregulated targets
13	SOX17	1.488534e-03	0.181872	17 downregulated targets
14	GRHL1	3.686336e-03	0.353888	16 downregulated targets
15	PLEK2	3.686336e-03	0.353888	16 downregulated targets
16	EHF	3.686336e-03	0.353888	16 downregulated targets
17	TBX22	1.488534e-03	0.538205	17 upregulated targets
18	SOX1	1.488534e-03	0.538205	17 upregulated targets
19	SOX30	3.686336e-03	0.538205	16 upregulated targets
20	ASCL1	3.686336e-03	0.538205	16 upregulated targets
21	LHX1	3.686336e-03	0.538205	16 upregulated targets
22	SP8	3.686336e-03	0.538205	16 upregulated targets
23	POU3F2	3.686336e-03	0.538205	16 upregulated targets
24	PAX3	3.686336e-03	0.538205	16 upregulated targets
25	LHX2	8.592411e-03	0.570224	15 upregulated targets
26	NEUROG3	8.592411e-03	0.570224	15 upregulated targets
27	HES5	8.592411e-03	0.570224	15 upregulated targets
28	FOXA2	8.592411e-03	0.570224	15 upregulated targets
29	POU4F1	8.592411e-03	0.570224	15 upregulated targets
30	OTX1	8.592411e-03	0.570224	15 upregulated targets
31	IRX2	8.592411e-03	0.570224	15 upregulated targets
32	NHLH2	8.592411e-03	0.570224	15 upregulated targets
33	DACH1	8.592411e-03	0.570224	15 upregulated targets
34	NEUROD4	8.592411e-03	0.570224	15 upregulated targets
35	NR2F1	8.592411e-03	0.570224	15 upregulated targets
36	POU3F3	8.592411e-03	0.570224	15 upregulated targets
37	FOXD4L1	8.592411e-03	0.721763	15 downregulated targets
38	BCL6B	8.592411e-03	0.721763	15 downregulated targets
39	ALX3	1.880198e-02	0.946582	14 upregulated targets
40	ZNF479	1.880198e-02	0.946582	14 upregulated targets
41	ZNF618	1.880198e-02	0.946582	14 upregulated targets
42	OTP	1.880198e-02	0.946582	14 upregulated targets
43	INSM1	1.880198e-02	0.946582	14 upregulated targets
44	SOX21	1.880198e-02	0.946582	14 upregulated targets
45	IRX1	1.880198e-02	0.946582	14 upregulated targets
46	OTOP3	1.880198e-02	0.999513	14 downregulated targets
47	HNF1B	1.880198e-02	0.999513	14 downregulated targets
48	PPP1R13L	1.880198e-02	0.999513	14 downregulated targets
49	EVX1	3.852079e-02	0.999513	13 upregulated targets
50	RHOXF1	3.852079e-02	0.999513	13 upregulated targets

** Table 5 | Transcription Factor Enrichment Analysis Results. **The figure contains scrollable tables displaying the results of the Transcription Factor (TF) enrichment analysis generated using Enrichr. Every row represents a TF; significant TFs are highlighted in bold. A and B display results generated using ChEA and ENCODE libraries, indicating TFs whose experimentally validated targets are enriched. C displays results generated using the ARCHS4 library, indicating TFs whose top coexpressed genes (according to the ARCHS4 dataset) are enriched.

Methods¶

Data¶

Data Source¶

Raw RNA-seq data for GEO dataset GSE76023 was downloaded from the SRA database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76023) and quantified to gene-level counts using the ARCHS4 pipeline (Lachmann et al., 2017). Gene counts were downloaded from the ARCHS4 gene expression matrix v1.1. For more information about ARCHS4, as well as free access to the quantified gene expression matrix, visit the project home page at the following URL: http://amp.pharm.mssm.edu/archs4/download.html.

Data Normalization¶

logCPM¶

Raw counts were normalized to log10-Counts Per Million (logCPM) by dividing each column by the total sum of its counts, multiplying it by 10⁶, followed by the application of a log10-transform.

Signature Generation¶

The gene expression signature was generated by comparing gene expression levels between the control group and the experimental group using the limma R package (Ritchie et al., Nucleic Acids Research 2015), available on Bioconductor: http://bioconductor.org/packages/release/bioc/html/limma.html.

PCA¶

Principal Component Analysis was performed using the PCA function from in the sklearn Python module. Prior to performing PCA, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method.

Clustergrammer¶

The interactive heatmap was generated using Clustergrammer (Fernandez et al., 2017) which is freely available at http://amp.pharm.mssm.edu/clustergrammer/. Prior to displaying the heatmap, the raw gene counts were normalized using the logCPM method, filtered by selecting the 2500 genes with most variable expression, and finally transformed using the Z-score method.

Library Size Analysis¶

Read counts were calculated by performing the sum for each column in the raw gene count matrix. Total counts were subsequently divided by 106 and displayed as million reads.

Differential Expression Table¶

The gene expression signature was generated by performing differential gene expression analysis using the methods described in the Differential Gene Expression section.

Volcano Plot¶

Gene fold changes were transformed using log2 and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. See the Differential Gene Expression section for more information on the methods used to generate these values.

MA Plot¶

Average gene expression was identified by calculating the mean of the normalized gene expression values and displayed on the x axis; P-values were corrected using the Benjamini-Hochberg method, transformed using –log10, and displayed on the y axis. For more information on the methods used to generate the signature, see the Differential Gene Expression section.

Enrichr Links¶

The up-regulated and down-regulated gene sets were generated by extracting the 500 genes with the respectively highest and lowest values from the gene expression signature. The gene sets were subsequently submitted to Enrichr (Kuleshov et al., 2016), which is freely available at http://amp.pharm.mssm.edu/Enrichr/, using the gene set upload API. For more information on the methods used to generate the signature, see the Differential Gene Expression section.

Gene Ontology Enrichment Analysis¶

Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: GO_Biological_Process_2017b, GO_Molecular_Function_2017b, GO_Cellular_Component_2017b. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.

Pathway Enrichment Analysis¶

Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: KEGG_2016, Reactome_2016, WikiPathways_2016. Significant terms are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.

Transcription Factor Enrichment Analysis¶

Enrichment results were generated by analyzing the up-regulated and down-regulated gene sets using Enrichr. The following libraries were used for the analysis: ChEA_2016, ENCODE_TF_ChIP-seq_2015, ARCHS4_TFs_Coexp. Significant results are determined by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction. For more information on the methods used to perform the enrichment analysis, see the Enrichr section.

References¶

Fernandez, N.F., Gundersen, G.W., Rahman, A., Grimes, M.L., Rikova, K., Hornbeck, P., and Ma'ayan, A. (2017). Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data. Scientific Data 4, 170151. doi: http://dx.doi.org/10.1038/sdata.2017.151

Kuleshov, M.V., Jones, M.R., Rouillard, A.D., Fernandez, N.F., Duan, Q., Wang, Z., Koplev, S., Jenkins, S.L., Jagodnik, K.M., Lachmann, A., et al. (2016). Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research 44, W90ÐW97. doi: https://dx.doi.org/10.1093/nar/gkw377

Lachmann, A., Torre, D., Keenan, A.B., Jagodnik, K.M., Lee, H.J., Silverstein, M.C., Wang, L., and Ma’ayan, A. (2017). Massive Mining of Publicly Available RNA-seq Data from Human and Mouse (Cold Spring Harbor Laboratory). doi: https://doi.org/10.1101/189092

Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559Ð572. doi: https://doi.org/10.1080/14786440109462720

Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47. doi: https://doi.org/10.1093/nar/gkv007

The Jupyter Notebook Generator is being developed by the Ma'ayan Lab at the Icahn School of Medicine at Mount Sinai
and is an open source project available on GitHub.